Y

YouLibs

Remove Touch Overlay

5 Reasons Parquet Files Are Better Than CSV for Data Analyses | PyData Global 2021

Duration: 29:29Views: 1.4KLikes: 51Date Created: Jan, 2022

Channel: PyData

Category: Science & Technology

Tags: pythonlearn to codeeducationsoftwarepydatalearncodinghow to programjuliaopensourcescientific programmingnumfocuspython 3tutorial

Description: 5 Reasons Parquet Files Are Better Than CSV for Data Analyses Speaker: Matthew Powers Summary Parquet files are well supported by most languages / libraries, are easier to work with, and typically more performant than CSV files. This talk summarizes the main benefits of Parquet files and shows how they’re faster with benchmarking analyses. You’ll also learn how to convert CSV files to Parquet. Description 5 reasons Parquet files are better than CSV: schema - examine how the schema is embedded in the file metadata leveraging PyArrow file sizes - compare file sizes when identical data is written to CSV and Parquet columnar file format - examine performance benefits from leveraging column pruning to skip data predicate pushdown filtering - understand how to query row group metadata with PyArrow and how to skip entire row groups based on column metadata immutable - why immutable file formats are better How to convert CSV files to Parquet with Pandas, Dask, and PySpark. Will show how to convert a single file or multiple files in parallel. When to use CSV files and when to avoid them. Matthew Powers's Bio Powers is a tech evangelist at Coiled. He used Spark / PySpark for 6 years and is now help devs understand when Dask is a better fit. He's written two books, has a popular blog, and regularly contributes to open source codebases. In a past life, he passed all three CFA exams and worked in finance. GitHub: github.com/MrPowers Twitter: twitter.com/_mrpowers_ Website: mungingdata.com PyData Global 2021 Website: pydata.org/global2021 LinkedIn: linkedin.com/company/pydata-global Twitter: twitter.com/PyData pydata.org PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases. 00:00 Welcome! 00:10 Help us add time stamps or captions to this video! See the description for details. Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: github.com/numfocus/YouTubeVideoTimestamps

Swipe Gestures On Overlay